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Abstract 

We consider a stochastic bandit problem with in¬ 
finitely many arms. In this setting, the learner 
has no chance of trying all the arms even once 
and has to dedicate its limited number of sam¬ 
ples only to a certain number of arms. All previ¬ 
ous algorithms for this setting were designed for 
minimizing the cumulative regret of the learner. 

In this paper, we propose an algorithm aiming at 
minimizing the simple regret. As in the cumula¬ 
tive regret setting of infinitely many armed ban¬ 
dits, the rate of the simple regret will depend on a 
parameter /3 characterizing the distribution of the 
near-optimal arms. We prove that depending on 
P, our algorithm is minimax optimal either up to 
a multiplicative constant or up to a log(n) factor. 

We also provide extensions to several important 
cases: when /3 is unknown, in a natural setting 
where the near-optimal arms have a small vari¬ 
ance, and in the case of unknown time horizon. 

1. Introduction 

Sequential decision making has been recently fueled by 
several industrial applications, e.g., advertisement, and rec¬ 
ommendation systems. In many of these situations, the 
learner is faced with a large number of possible actions, 
among which it has to make a decision. The setting we 
consider is a direct extension of a classical decision-making 
setting, in which we only receive feedback for the actions 
we choose, the bandit setting. In this setting, at each time t, 
the learner can choose among all the actions (called the 
arms) and receives a sample {reward) from the chosen ac¬ 
tion, which is typically a noisy characterization of the ac¬ 
tion. The learner performs n such rounds and its perfor¬ 
mance is then evaluated with respect to some criterion, for 
instance the cumulative regret or the simple regret. 
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In the classical, multi-armed bandit setting, the number of 
actions is assumed to be finite and small when compared 
to the number of decisions. In this paper, we consider an 
extension of this setting to infinitely many actions, the in¬ 
finitely many armed bandits (Berry et al., 1997; Wang et al., 
2008; Bonald & Proutiere, 2013). Inevitably, the sheer 
amount of possible actions makes it impossible to try each 
of them even once. Such a setting is practically relevant for 
cases where one faces a finite, but extremely large num¬ 
ber of actions. This setting was first formalized by Berry 
et al. (1997) as follows. At each time t, the learner can 
either sample an arm (a distribution) that has been already 
observed in the past, or sample a new arm, whose mean p. 
is sampled from the mean reservoir distribution C. 

The additional challenges of the infinitely many armed ban¬ 
dits with respect to the multi-armed bandits come from two 
sources. First, we need to find a good arm among the 
sampled ones. Second, we need to sample (at least once) 
enough arms in order to have (at least once) a reasonably 
good one. These two difficulties ask for a while which 
we call the arm selection tradeoff. It is different from the 
known exploration/exploitation tradeoff and more linked 
to model selection principles: On one hand, we want to 
sample only from a small subsample of arms so that we 
can decide, with enough accuracy, which one is the best 
one among them. On the other hand, we want to sample 
as many arms as possible in order to have a higher chance 
to sample a good arm at least once. This tradeoff makes 
the problem of infinitely many armed bandits significantly 
different from the classical bandit problem. 

Berry et al. (1997) provide asymptotic, minimax-optimal 
(up to a log n factor) bounds for the average cumulative re¬ 
gret, defined as the difference between n times the highest 
possible value ft* of the mean reservoir distribution and the 
mean of the sum of all samples that the learner collects. A 
follow-up on this result was the work of Wang et al. (2008), 
providing algorithms with finite-time regret bounds and the 
work of Bonald & Proutiere (2013), giving an algorithm 
that is optimal with exact constants in a strictly more spe- 
cihc setting. In all of this prior work, the authors show 
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that it is the shape of the arm reservoir distribution what 
characterizes the minimax-optimal rate of the average cu¬ 
mulative regret. Specifically, Berry et al. (1997) and Wang 
et al. (2008) assume that the mean reservoir distribution is 
such that, for a small e > 0, locally around the best arm 
pL*, we have that 

(m* - M > £) « (1) 

that is, they assume that the mean reservoir distribution is 
/^-regularly varying in (2*. When this assumption is satis¬ 
fied with a known /3, their algorithms achieve an expected 
cumulative regret of order 

E [R„] = O l^max polylog n, \/n poly log ^ . (2) 

The limiting factor in the general setting is a ^Is/n rate 
for estimating the mean of any of the arms with n sam¬ 
ples. This gives the rate (2) of ^/n. It can be refined if the 
distributions of the arms, that are sampled from the mean 
reservoir distribution, are Bernoulli of mean p, and p* = 1 
or in the same spirit, if the distributions of the arms are 
defined on [0,1] and /2* = 1 as 

E [Rn] = O polylog . (3) 

Bonald & Proutiere (2013) refine the result (3) even more 
by removing the polylog n factor and proving upper and 
lower bounds that exactly match, even in terms of con¬ 
stants, for a specific sub-case of a uniform mean reservoir 
distribution. Notice that the rate (3) is faster than the more 
general rate (2). This comes from the fact that they assume 
that the variances of the arms decay with their quality, mak¬ 
ing finding a good arm easier. For both rates (2 and 3), /? is 
the key parameter for solving the arm selection tradeoff: 
with smaller /? it is more likely that the mean reservoir dis¬ 
tribution outputs a high value, and therefore, we need fewer 
arms for the optimal arm selection tradeoff. 

Previous algorithms for this setting were designed for mini¬ 
mizing the cumulative regret of the learner which optimizes 
the cumulative sum of the rewards. In this paper, we con¬ 
sider the problem of minimizing the simple regret. We want 
to select an optimal arm given the time horizon n. The sim¬ 
ple regret is the difference between the mean of the arm 
that the learner selects at time n and the highest possible 
mean p*. The problem of minimizing the simple regret 
in a multi-armed bandit setting (with finitely many arms) 
has recently attracted significant attention (Even-Dar et al., 
2006; Audibert et al., 2010; Kalyanakrishnan et al., 2012; 
Kaufmann & Kalyanakrishnan, 2013; Karnin et al., 2013; 
Gabillon et al., 2012; Jamieson et al., 2014) and algorithms 
have been developed either in the setting of a fixed budget 
which aims at finding an optimal arm or in the setting of a 
floating budget which aims at finding an e-optimal atm. 


All prior work on simple regret considers a fixed number 
of arms that will be ultimately all explored and cannot be 
applied to an infinitely many armed bandits or to a bandit 
problem with the number of arms larger than the available 
time budget. An example where efficient strategies for min¬ 
imizing the simple regret of an infinitely many armed ban¬ 
dit are relevant is the search of a good biomarker in biology, 
a single feature that performs best on average (Hauskrecht 
et al., 2006). There can be too many possibilities that we 
cannot afford to even try each of them in a reasonable time. 
Our setting is then relevant for this special case of single 
feature selection. In this paper, we provide the following 
results for the simple regret of an infinitely many armed 
bandit, a problem that was not considered before. 

• We propose an algorithm that for a fixed horizon n 
achieves the finite-time simple regret rate 

Tn = O ^max ,n~fi polylog . 

• We prove corresponding lower bounds for this in¬ 
finitely many armed simple regret problem, that are 
matching up to a multiplicative constant for /3 < 2, 
and matching up to a polylog n for P >2. 

• We provide three important extensions: 

- The first extension concerns the case where the 
distributions of the arms are defined on [0,1] and 
where p* = 1. In this case, replacing the Ho- 
effding bound in the confidence term of our al¬ 
gorithm by a Bernstein bound, bounds the simple 
regret as 

r„ = C>(max(i polylog n, (nlogn)”? polyloglog n). 

- The second extension treats unknown fl. We 
prove that it is possible to estimate /? with enough 
precision, so that its knowledge is not necessary 
for implementing the algorithm. This can be also 
applied to the prior work (Berry et al., 1997; 
Wang et al., 2008) where /? is also necessary for 
implementation and optimal bounds. 

- Finally, in the third extension we make the algo¬ 
rithm anytime using known tools. 

• We provide simple numerical simulations of our algo¬ 
rithm and compare it to infinitely many armed bandit 
algorithms optimizing cumulative regret and to multi¬ 
armed bandit algorithms optimizing simple regret. 

Besides research on infinitely many arms bandits, there ex¬ 
ist many other settings where the number of actions may be 
infinite. One class of examples is fixed design such as lin¬ 
ear bandits (Dani et al., 2008) other settings consider ban¬ 
dits in known or unknown metric space (Kleinberg et al.. 
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2008; Munos, 2014; Azar et al., 2014). These settings as¬ 
sume regularity properties that are very different from the 
properties assumed in the infinitely many arm bandits and 
give rise to significantly different approaches and results. 
Furthermore, in classic optimization settings, one assumes 
that in addition to the rewards, there is side information 
available through the position of the arms, combined with a 
smoothness assumption on the reward, which is much more 
restrictive. On the contrary, we only assume a bound on 
the proportion of near-optimal arms. It is not always the 
case that there is side information through a topology on 
the arms. In such cases, the infinitely many armed setting 
is applicable while optimization routines are not. 

2. Setting 

Learning setting Let £ be a distribution of distributions. 
We call £ the arm reservoir distribution, i.e., the distribu¬ 
tion of the means of arms. Let £ be the distribution of 
the means of the distributions output by £, i.e., the mean 
reservoir distribution. Let A( denote the changing set of 
Kt arms at time t. 

At each time t + 1, the learner can either choose an arm 
kt+i among the set of the Kt arms At = {vi,, UKt} 
that it has already observed (in this case, Kt+i = Kt and 
Ai+i = At), or choose to get a sample of a new arm that is 
generated according to £ (in this case, Kt+i = Kt + 1 and 
Ai+i = At U where u^t+i ^ £). Let pt be the 

mean of arm i, i.e., the mean of distribution for i < Kt. 
We assume that always exists. 

In this setting, the learner observes a sample at each time. 
At the end of the horizon, which happens at a given time 
n, the learner has to output an arm k < Kn, and its perfor¬ 
mance is assessed by the simple regret 

rn= P* - Mfc, 

where p,* = arginf^ < m) = 1) is the right end 

point of the domain. 

Assumption on the samples The domain of the arm 
reservoir distribution £ are distributions of arm samples. 
We assume that these distributions v are bounded. 

Assumption 1 (Bounded distributions in the domain of £). 
Let V be a distribution in the domain of £. Then v is a 
bounded distribution. Specifically, there exists an universal 
constant C > 0 such that the domain of v is contained in 
[-C,C]. 

This implies that the expectations of all distributions gener¬ 
ated by £ exist, are finite, and bounded by C. In particular, 
this implies that 

p* = arginf < m) = 1) < -foo, 


which implies that the regret is well defined, and that the 
domain of £ is bounded by 2(7. Note that all the results 
that we prove hold also for sub-Gaussian distributions v 
and bounded £. Furthermore, it would possible to relax the 
sub-Gaussianity using different estimators recently devel¬ 
oped for heavy-tailed distributions (Catoni, 2012). 

Assumption on the arm reservoir distribution We now 

assume that the mean reservoir distribution £ has a certain 
regularity in its right end point, which is a standard assump¬ 
tion for infinitely many armed bandits. Note that this im¬ 
plies that the distribution of the means of the arms is in the 
domain of attraction of a Weibull distribution, and that it 
is related to assuming that the distribution is (3 regularly 
varying in its end point p*. 

Assumption 2 (/3 regularity in p*). Let /3 > 0. There exist 
E, E' > 0, and 0 < B < 1 such that for any 0 < e < B, 

E'e^ > {fi> p* — e) > Ee^. 

This assumption is the same as the classical one (1). Stan¬ 
dard bounded distributions satisfy Assumption 2 for a spe¬ 
cific P, e.g., all the /3 distributions, in particular the uniform 
distribution, etc. 

3. Main results 

In this section, we first present the information theoretic 
lower bounds for the infinitely many armed bandits with 
simple regret as the objective. We then present our algo¬ 
rithm and its analysis proving the upper bounds that match 
the lower bounds — in some cases, depending on /3, up 
to a polylog n factor. This makes our algorithm (almost) 
minimax optimal. Finally, we provide three important ex¬ 
tensions as corollaries. 

3.1. Lower bounds 

The following theorem exhibits the information theoretic 
complexity of our problem and is proved in Appendix C. 
Note that the rates crucially depend on p. 

Theorem 1 (Lower bounds). Let us write Sp for the set of 
distributions of arms distributions £ that satisfy Assump¬ 
tions 1 and 2 for the parameters P, E, E', C. Assume that 
n is larger than a constant that depends on P, E, E', B, C. 
Depending on the value of P, we have the following results, 
for any algorithm A, where v is a small enough constant. 

• Case P < 2: With probability larger than 1 /3, 

inf sup r„ > 

^ CeSa 
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• Case j3 > 2; With probability larger than 1 /3, 

inf sup r„ > . 

^ CeSfi 

Remark 1. Comparing these results with the rates for the 
cumulative regret problem (2) from the prior work, one can 
notice that there are two regimes for the cumulative regret 
results. One regime is characterized by a rate of y/n for 
/3 < 1, and the other characterized by a n^/C+P) j-ate for 
/? > 1. Both of these regimes are related to the arm se¬ 
lection tradeoff. The hrst regime corresponds to easy prob¬ 
lems where the mean reservoir distribution puts a high mass 
close to p,*, which favors sampling a good arm with high 
mean from the reservoir. In this regime, the y/n rate comes 
from the parametric 1 / y/n rate for estimating the mean of 
any arm with n samples. The second regime corresponds 
to more difficult problems where the reservoir is unlikely to 
output a distribution with mean close to p* and where one 
has to sample many arms from the reservoir. In this case, 
the y/n rate is not reachable anymore because there are too 
many arms to choose from sub-samples of arms containing 
good arms. The same dynamics exists also for the simple 
regret, where there are again two regimes, one character¬ 
ized by a rate for /3 < 2, and the other characterized 

by a rate for /3 > 2. Provided that these bounds are 

tight (which is the case, up to a polylog n, Section 3.2), one 
can see that there is an interesting difference between the 
cumulative regret problem and the simple regret one. In¬ 
deed, the change of regime is here for /3 = 2 and not for 
(3 = 1, i.e., the parametric rate of is valid for larger 
values of (3 for the simple regret. This comes from the fact 
that for the simple regret objective, there is no exploita¬ 
tion phase and everything is about exploring. Therefore, an 
optimal strategy can spend more time exploring the set of 
arms and reach the parametric rate also in situations where 
the cumulative regret does not correspond to the parametric 
rate. This has also practical implications examined empiri¬ 
cally in Section 5. 

3.2. SiRI and its upper bounds 

In this section, we present our algorithm, the Simple Regret 
for Infinitely many arms (SiRI) and its analysis. 

The SiRI algorithm Let b = min(/3, 2), and let 

fp = \A{n)n!>/% 


Algorithm 1 SiRI 

Simple Regret for Infinitely Many Armed Bandits 

Parameters: (3, C, 5 

Initial pull of arms from the reservoir: 

Choose Tp arms from the reservoir C . 

Pull each of Tp arms once. 
t Tp 

Choice between these arms: 
while f < n do 

For any k <Tp: 


Bk,t Bk,t + 24 / —— log (2^*'’/^/(Tfc_i(5)) 

y k,t 

+ |^log(22‘^/V(Tfe.t5)) (4) 

Pull Tk,t times the arm kt that maximizes Bk,t and 
receive Tk,t samples from it. 
t <— t + Tk,t 

end while 

Output: Return the most pulled arm k. 


Let us dehne 

ift = Llog2('7>)J- 

Let Tfc i be the number of pulls of arm k < Kt, and Xk,u 
for the M-th sample of i/k- The empirical mean of the sam¬ 
ples of arm k is dehned as 


Xk,u- 

With this notation, we provide SiRI as Algorithm 1. 

Discussion SiRI is a UCB-based algorithm, where the 
leading conhdence term is of order 

log {n/{STkp)) 

Tk,t 

Similar to the MOSS algorithm (Audibert & Bubeck, 
2009), we divide the log(-) term by Tkp, in order to avoid 
additional logarithmic factors in the bound. But a simpler 
algorithm with a confidence term as in a classic UCB algo¬ 
rithm for cumulative regret. 


Tk,, 


Bk,: 




where 

(A, A (3 <2 

A{n) = i A/ log(n)^, if (3 = 2 
i^/log(n), if (3 >2 

where A is a small constant whose precise value will de¬ 
pend on our analysis. Let log 2 be the logarithm in base 2. 


log(n/(5) 

Tk,t 

would provide almost optimal regret, up to a logn, 
i.e., with a slightly worse regret than what we get. It is quite 
interesting that with such a conhdence term, SiRI is opti¬ 
mal for minimizing the simple regret for inhnitely many 
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armed bandits, since MOSS, as well as the classic UCB al¬ 
gorithm, targets the cumulative regret. The main difference 
between our strategy and the cumulative strategies (Berry 
et al., 1997; Wang et al., 2008; Bonald & Proutiere, 2013) 
is in the number of arms sampled from the arm reservoir: 
For the simple regret, we need to sample more arms. Al¬ 
though the algorithms are related, their analyses are quite 
different: Our proof is event-based whereas the proof for 
the cumulative regret targets directly the expectations. 

It is also interesting to compare SiRI with existing algo¬ 
rithms targeting the simple regret for finitely many arms, 
as the ones by Audibert et al. (2010). SiRI can be related to 
their UCB-E with a specific confidence term and a specific 
choice of the number of arms selected. Consequently, the 
two algorithms are related but the regret bounds obtained 
for UCB-E are not informative when there are infinitely 
many arms. Indeed, the theoretical performance of UCB- 
E is decreasing with the sum of the inverse of the gaps 
squared, which is infinite when there are infinitely many 
arms. In order to obtain a useful bound in this case, we 
need to consider a more refined analysis which is the one 
that leads to Theorem 2. 

Remark 2. Note that SiRI pulls series of samples from the 
same arm without updating the estimate which may seem 
wasteful. In fact, it is possible to update the estimates af¬ 
ter each pull. On the other hand, SiRI is already minimax 
optimal, so one can only hope to get improvement in con¬ 
stants. Therefore, we present this version of SiRI, since its 
analysis is easier to follow. 

Main result We now state the main result which charac¬ 
terizes SiRTs simple regret according to /3. 

Theorem 2 (Upper bounds). Let 5 > 0. Assume all As¬ 
sumptions 1 and 2 of the model and that n is larger than a 
large constant that depends on /3, E, E', B, C. Depending 
on the value of /3, we have the following results, where E 
is a large enough constant. 

• Case (3 < 2: With probability larger than 1 — 5, 

r„ < En-^/^ log(l/,5)(log(log(l/5)))9® ~ 

• Case j3 > 2: With probability larger than 1 — 

r„ < E;(nlog(n))“^/^(log(log(log(n)/(5)))®®x 
X log(log(n)/(5) ^ (nlogn)“^/^polyloglogn. 

• Case P = 2: With probability larger than 1 — 

Tn < Elog(n)n"^/2(log(log(log(n)/())))®®x 
X log(log(n)/(5) ~ lognpolyloglogn. 


Short proof sketch. In order to prove the results, the main 
tools are events and ^2 (Appendix B). One event con¬ 
trols the number of arms at a given distance from p,* and 
the other one controls the distance between the empirical 
means and the true means of the arms. 

Provided that events ^1 and ^2 hold, which they do with 
high probability, we know that there are less than approxi¬ 
mately Nu = T/ 32 ““ arms at a distance larger than 2r'^l^ 
from p*, and that each arm that is at a distance larger than 
2 “"/^ from p* will be pulled less than = 2^"^^ times. 
After these many pulls, the algorithm recognizes that it is 
suboptimal. 

Since a simple computation yields 

E ^ 

0<«<log2(T;3) 

we know that all the suboptimal arms at a distance further 
than 2~ *°S2(T3)//5 from the optimal arm are discarded since 
they are all sampled enough to be proved suboptimal. We 
thus know that an arm at a distance less than 2~ 
from the optimal arm is selected in high probability, which 
concludes the proof. 

The full proof(Appendix B) is quite technical, since it uses 
a peeling argument to correctly define the high probability 
event to avoid a suboptimal rate, in particular in terms of 
log n terms for /? < 2, and since we need to control accu¬ 
rately the number of arms at a given distance from p* at the 
same time as their empirical means. □ 

Discussion The bound we obtain is minimax optimal for 
P < 2 without additional log n factors. We emphasize it 
since the previous results on infinitely many armed ban¬ 
dits give results which are optimal up to a polylog n fac¬ 
tor for the cumulative regret, except the one by Bonald & 
Proutiere (2013) which considers a very specific and fully 
parametric setting. Eor P > 2, our result is optimal up to 
a polylog n factor. We conjecture that the lower bound of 
Theorem 1 for /3 > 2 can be improved to (log(n)/n)^/^ 
and that SiRI is actually optimal up to a polyloglog(n) fac¬ 
tor for P > 2. 

4. Extensions of SiRI 

We now discuss briefly three extensions of the SiRI algo¬ 
rithm that are very relevant either for practical or compu¬ 
tational reasons, or for a comparison with the prior results. 
In particular, we consider the cases 1) when P is unknown, 
2 ) in a natural setting where the near-optimal arms have a 
small variance, and 3) in the case of unknown time horizon. 
These extensions are all in some sense following from our 
results and from the existing literature, and we will there¬ 
fore state them as corollaries. 
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Algorithm 2 Bemstein-SiRI 
Parameters: C,I3,5 
Newly defined quantities: 

Set the number of arms as 

fp = |’min(n/log(n), , 

Modify the SiRI algorithm’s UCB (4) with 

Bk,t l^k,t + 2crfe^t4 /-=— log (TkpS)^ 

y Jfc,i 

+ ^log 

where ^ is the empirical variance, defined as 

^ 'Tfc.t 

^k,t ~ ^ , i^k,t ~ ^J'k,t) ■ 

1=1 

Call SiRI: 

Run SiRI on the samples using these new parameters 


4.1. Case of distributions on [0,1] with /r* = 1 

The first extension concerns the specific setting, particu¬ 
larly highlighted by Bonald & Proutiere (2013) but also 
presented by Berry et al. (1997) and Wang et al. (2008), 
where the domain of the distributions of the arms are in¬ 
cluded in [0,1] and where p,* = 1. In this case, the infor¬ 
mation theoretic complexity of the problem is smaller than 
the one of the general problem stated in Theorem 1 . Specif¬ 
ically, the variance of the near-optimal arms is very small, 
i.e., in the order of e for an £-optimal arm. This implies a 
better bound, in particular, that the parametric limitation of 
1/y^ can be circumvented. In order to prove it, the sim¬ 
plest way is to modify SiRI into Bernstein-SiRI, displayed 
in Algorithm 2. It is an Empirical Bernstein-modified SiRI 
algorithm that accommodates the situation of distributions 
of support included in [0,1] with p* = 1. Note that in 
the general case, it would provide similar results as what is 
provided in Theorem 2. 

A similar idea was already introduced by Wang et al. 
(2008) in the infinitely many armed setting for cumula¬ 
tive regret. The idea is that the confidence term is more 
refined using the empirical variance and hence it will be 
very large for a near-optimal arm, thereby enhancing ex¬ 
ploration. Plugging this term in the proof, conditioning on 
the event of high probability, such that ^ is close to the 
true variance, and using similar ideas as Wang et al. (2008), 
we can immediately deduce the following corollary. 

Corollary 1. Let 6 > 0. Assume Assumptions 1 and 2 of 
the model and that n is larger than a large constant that 


depends on E, E', B, C. Furthermore, assume that all 
the arms have distributions of support included in [0,1] and 
that p* = 1. Depending on fi, we have the following results 
for Bernstein-SiRI. 

• Case /3 < 1: The order of the simple regret is with 
high probability 

Tn = 0 (i polylog n) . 

• Case /3 > 1: The order of the simple regret is with 
high probability 

Tn = 0 (^(i)^^'^polylogn) . 

Moreover, the rate 



is minimax-optimal for this problem, i.e., there exists 
no algorithm that achieves a better simple regret in a 
minimax sense. 

The proof follows immediately from the proof of Theo¬ 
rem 2 using the empirical Bernstein bound as by Wang et al. 
(2008). Moreover, the lower bounds’ rates follow directly 
from the two facts: 1) 1/n is clearly a lower bound, and 
therefore optimal for /3 < 1, since it takes at least n sam¬ 
ples of a Bernoulli arm that is constant times 1 /n subopti- 
mal, in order to discover that it is not optimal, and 2) 
can be trivially deduced from Theorem 1 * . Bemstein-SiRI 
is thus minimax optimal for /? > 1 up to a polylog n factor. 

Discussion Corollary 1 improves the results of Theo¬ 
rem 2 when /? G (0, 2). For these /3, it is possible to beat 
the parametric rate of 1/i/n, since in this case, the vari¬ 
ance of the arms decays with the quality of the arms. In 
this situation, for /3 < 2, it is possible to beat the para¬ 
metric rate Ij^/n and keep the rate of until /3 < 1, 

where the limiting rate of 1/n imposes its limitations: the 
regret cannot be smaller than the second order parametric 
rate of 1/n. Here, the change point of regime is ^ = 1 
which differs from the general simple regret case but is the 
same as the general case of cumulative regret as discussed 
in Remark 1 . Notice that this comes from the fact that the 
limiting rate is now 1 /n and not for same reasons as for the 
cumulative regret. 

'indeed, its proof shows that a lower bound of the order of 
is valid for any distribution and in particular for Bernoulli 
with mean p, and p* = 1, which is a special case of distributions 
of support included in [0,1] and that p* = 1. 
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4.2. Dealing with unknown [3 

In practice, the parameter /3 is almost never available. 
Yet its knowledge is crucial for the implementation of 
SiRI, as well as for all the cumulative regret strategies de¬ 
scribed in (Berry et al., 1997; Wang et al., 2008; Bonald & 
Proutiere, 2013). Consequently, a very important question 
is whether it is possible to estimate it well enough to obtain 
good results, which we answer in the affirmative. 

An interesting remark is that Assumption 2 is actually re¬ 
lated to assuming that the distribution function £ is /? regu¬ 
larly varying in p,* *. Therefore, /3 is the tail index of the dis¬ 
tribution function of C and can be estimated with tools from 
extreme value theory (de Haan & Ferreira, 2006). Many es¬ 
timators exist for estimating this tail index (3, for instance, 
the popular Hill’s estimate (Hill, 1975), but also Pickand’s’ 
estimate (Pickands, 1975) and others. 


However, our situation is slightly different from the one 
where the convergence of these estimators is proved, as the 
means of the arms are not directly observed. As a result, we 
propose another estimate, related to the estimate of Carpen- 
tier & Kim (2014), which accommodates our setting. As¬ 
sume that we have observed N arms, and that all of these 
arms have been sampled N times. Let us write for the 
empirical mean estimates of the mean nik of these N arms 
and define 

fh* = max. fhk- 

k 

We further define 

1 ^ 

p = — ^ l{m* -fhk< N-’^} 


and set 


/3 = - 


logp 


(5) 


elogiV 

This estimate satisfies the following weak concentration in 
equality and its proof is in Appendix D. 


Lemma 1. Let j3 be a lower bound on (3. If Assumptions 1 
and 2 are satisfied and e < min(/3,1/2,1/(/3)), then with 
probability larger than 1 — (5, for N larger than a constant 
that depends only on B of Assumption 2, 


\d-i3\< 


^ + max(l, log(.E'), | \og{E)\) 

elogN 


^ c'max(yiog(l/i5),() 

“ e log N 

where c' > 0 is a constant that depends only on e and the 
parameter C of Assumption 1. 


Let us now modify SiRI in the way as in Algorithm 3. The 
knowledge of /3 is not anymore required, and one just needs 
a lower bound (3 on fi. We get ,5-SiRI which satisfies the 
following corollary. 


Algorithm 3 /3-SiRI: j3-modified SiRI for unknown 
Parameters: C,S,/3 
Initial phase for estimating (3: 

Let N •(— and e •(— l/logloglog(n). 
Sample N arms from the arm reservoir N times 
Compute P following (5) 

Set 


P <r- P + 


c' max (^V'log(l/i5),5 logloglogn 


logn 


( 6 ) 


Call SiRI: 

Run SiRI using P instead of P with n — = n — yTi 

remaining samples. 


Corollary 2. Let the Assumptions 1 and 2 be satisfied. If n 
is large enough with respect to a constant that depends on 
P, E, E', B, C, then P—SiRI satisfies the following: 

• Case P < 2: The order of the simple regret is with 
high probability 

rn = o[-^ polyloglogn) . 

• Case P > 2: The order of the simple regret is with 
high probability 

rn = 0 ^ polyloglogn^ . 

• Case P = 2: The order of the simple regret is with 
high probability 

"^^ = 0 polyloglogn) . 

The proof can be deduced easily from Theorem 2 using the 
result from Lemma 1, noting that a 1/logn rate in learn¬ 
ing P is fast enough to guarantee that all bounds will only 
be modified by a constant factor when we use P instead of 
P in the exponent. 

Discussion Corollary 2 implies that even in situations 
with unknown P, it is possible to estimate it accurately 
enough so that the modified ,5-SiRI remains minimax- 
optimal up to a polylog n, by only using a lower bound 
P on p. This is the same that holds for SiRI with known 
P. We would like to emphasize that P estimate (6) of P 
can be used to improve cumulative regret algorithms that 
need P, such as the ones by Berry et al. (1997) and Wang 
et al. (2008). Similarly for these algorithms, one should 
spend a preliminary phase of N'^ = y/n rounds to esti¬ 
mate P and then run the algorithm of choice. This will 
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modify the cumulative regret rates in the general setting 
by only a polyloglogn factor, which suggests that our (3 
estimation can be useful beyond the scope of this paper. 
For instance, consider the cumulative regret rate of UCB-F 
by Wang et al. (2008). If UCB-F uses our estimate of f3 
instead of the true (3, it would still satisfy 

E [i?„] = O ("max (n'^ polylogn, Vnpolylognj) . 



Figure 1. Uniform and B(l, 2) reservoir distribution 


Finally, this modification can be used to prove that this 
problem is learnable over all mean reservoir distributions 
with j3 > 0: This can be seen by setting the lower bound on 
/3 as /3 = 1/ log log log N, which goes to 0 but very slowly 
with n. In this case, we only loose a loglog(n) factor. 

4.3. Anytime algorithm 

Another interesting question is whether it is possible to 
make SiRI anytime. This question can be quickly answered 
in the affirmative. First, we can easily just use a doubling 
trick to double the size of the sample in each period and 
throw away the preliminary samples that were used in the 
previous period. Second, Wang et al. (2008) propose a 
more refined way to deal with an unknown time horizon 
(UCB-AIR), that also directly applies to SiRI. Using these 
modifications it is straightforward to transform SiRI into an 
anytime algorithm. The simple regret in this anytime set¬ 
ting will only be worsened by a polylog n, where n is the 
unknown horizon. Specifically, in the anytime setting, the 
regret of SiRI modified either using the doubling trick or 
by the construction of UCB-AIR has a simple regret that 
satisfies with high probability 

rn = O ^polylog(n) max(n“^/^, polylogn)^ . 

5. Numerical simulations 

To simulate different regimes of the performance according 
to ^-regularity, we consider different reservoir distributions 
of the arms. In particular, we consider beta distributions 
B{x,y) with as a; = 1 and y = (3. For B(l,/3), the As¬ 
sumption 2 is satisfied precisely with regularity /?. Since to 
our best knowledge, SiRI is the first algorithm optimizing 
simple regret in the infinitely many arms setting, there is no 
natural competitor for it. Nonetheless, in our experiments 
we compare to the algorithms designed for linked settings. 

First such comparator is UCB-F (Wang et al., 2008), an 
algorithm that optimizes cumulative regret for this setting. 
UCB-F is designed for fixed horizon of n evaluations and 
it is an extension of a version of UCB-V by Audibert et al. 
(2007). Second, we compare SiRI to lil’UCB (Jamieson 
et al., 2014) designed for the best-arm identification in the 
fixed confidence setting. The purpose of comparison with 
lil’UCB is to show that SiRI performs at par with lil’UCB 



Figure 2. Comparison on B(l, 3) and unknown j3 on B(l, 1) 


equipped with the optimal number of Tp arms. In all our 
experiments, we set constant A of SiRI to 0.3, constant C 
to 1, and confidence 5 to 0.01. 

All the experiments have some specific beta distribution as 
a reservoir and the arm pulls are noised with A/'(0,1) trun¬ 
cated to [0,1]. We perform 3 experiments based on differ¬ 
ent regimes of (3 coming from our analysis: (3 < 2, (3 = 2, 
and /? > 2. In the first experiment (Figure 1, left) we take 
(3 = 1, i.e., B(l, 1) which is just a uniform distribution. In 
the second experiment (Figure 1, right) we consider B(l, 2) 
as the reservoir. Finally, Figure 2 features the experiments 
for B(l, 3). The first obvious observation confirming the 
analysis is that higher (3 leads to a more difficult problem. 
Second, UCB-F performs well for (3 = 1, slightly worse 
for (3 = 2, and much worse for (3 = 3. This empirically 
confirms our discussion in Remark 1. Finally, SiRI per¬ 
forms empirically as well as lil’UCB equipped with the op¬ 
timal number of arms and the same confidence 6. Figure 2 
also compares SiRI with ^-SiRI for the uniform distribu¬ 
tion. For this experiment, using ^/n samples just for the (3 
estimation did not decrease the budget too much and at the 
same time, the estimated /3 was precise enough not to hurt 
the final simple regret. 

Conclusion We presented SiRI, a minimax optimal algo¬ 
rithm for simple regret in infinitely many arms bandit set¬ 
ting, which is interesting when we face enormous number 
of potential actions. Both the lower and upper bounds give 
different regimes depending on a complexity (3, a parame¬ 
ter for which we also give an efficient estimation procedure. 
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A. Additional notation 

We write Pi for the probability with respect to the arm reservoir distribution, P 2 for the probability with respect to the 
distribution of the samples from the arms, and Pi 2 for the probability both with respect to the arm reservoir distribution 
and the distribution of the samples from the arms. 

Let F be the distribution function of the mean reservoir distribution C. Let F~^ be the pseudo-inverse of the mean 
reservoir distribution. In order to express the regularity assumption, we dehne 

We assume that G has a certain regularity in its right end point, which is a standard assumption for inhnitely many armed 
bandits. In particular, we rewrite Assumption 2 by only modifying the constants E,E', and B. 

Assumption 3 (/? regularity in p,*, version 2). Let /3 > 0. There exist E, E', B G (0,1) such that Vu S [0, B], 

E'u^/^ >G{u)> Eu^/P. 

This assumption is equivalent to Assumption 2 which is the same as the classic one (1) by dehnition of G and E and we 
reformulate it for the convenience of analysis. Without loss of generality, we assume that p* > 0. 

B. Full proof of Theorem 2 

B.l. Roadmap 

The proof of Theorem 2 (upper bounds) is composed of two layers. The hrst layer consists of proving results on the 
empirical distributions of the arms emitted by the arms reservoir, the crucial object is event ^ 1 . The second layer consists 
of proving results on the random samples of the arms, and in particular that the empirical means of the arms are not too 
different from the true means of the arms. For this part, the crucial object is event ^ 2 - More precisely, these two layers can 
be decomposed as follows. 

• We prove of suitable high probability upper bounds and lower bounds on the number of arms among the Tp arms 
pulled by the algorithm that have a given gap (with respect to p*), depending on the considered gap. This is done 
in Lemma 4. Two important results can be consequently deduced: (i) An upper bound on the number of suboptimal 
arms depending on how suboptimal they are. The more suboptimal they are, the more arms they are, which depends 
on /3. (ii) A proof that among the Tp arms pulled by the algorithm, there is with high probability at least one arm, and 
not signihcantly more than one arm, that has a gap smaller than the simple regret from Theorem 2. This is done in 
Corollary 3. 

• In Lemma 5, we prove that with high probability, the empirical means of the arms are not too different from their true 
means. The main difficulty is that the means of the arms are random. In order to avoid suboptimal log(n) dependency 
in the case /3 < 2, we use a peeling argument where the peeling is done over these random gaps, using the result from 
the previous layer, i.e., the bound on the number of arms with a given gap. 

Afterwards, we combine the two results to bound the number of suboptimal pulls (section B.5). Since the algorithm pulls 
the arms depending on the empirical gaps, then (i) the bounds on the number of suboptimal and near-optimal arms, and (ii) 
the bounds on the deviations of the empirical means with respect to the true means, will allow to obtain the desired bound 
on the number of suboptimal arms. By construction of the strategy and in particular, by the choice of Tp, we prove that 
with high probability, the number of pulls of the optimal arms is smaller than a fraction of n. This means that there is a 
near-optimal arm that is pulled more than n/2 times. This is the one selected by the algorithm which concludes the proof. 

B.2. Concentration inequalities 

We make several uses of Bernstein’s inequality: 

Lemma 2 (Bernstein’s inequality). Let E(Art) = 0, \Xt\ < 6 > 0, and < u > 0. Then for any (5 > 0, with 

probability at least 1 — (5 

n 

Xt < \/2nv log 5“^ + I log S~^ 
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Furthermore, Algorithm 2 along with Corollary 1 are based on the empirical Bernstein concentration inequality. 
Lemma 3 (Empirical Bernstein’s inequality). Let E(Ait) = 0, \Xt \ < b > 0. Let for any j = 1,... ,n 




t=i 


Then for any (5 > 0, with probability at least 1 — 5 

1 , - 

Y2nV,-log (35“^) + 361og (35“^) 

t=i 


B.3. Notation 

For any i < Kn, set 

B b’i') 

where we remind the reader that p,i is the mean of distribution of arm i. 

Without loss of generality, we assume that f* > 0. For any u G N, we define 

4= [/2*-G(2-“),/2*-G(2-“-i)] . 


We also define 


We further define 

/i*-G(2-‘>) ,(2* 

Let Nu be the number of arms in segment 


r = 


f* -G{2-^n ,fl* -G{0) 


Tp 

Nu = l{p-fc G 

k=l 


and let N* be the number of arms in the segment /*, 

Tp 

iv* = ^ i{pfc G r}. 

fc=i 


B.4. Favorable high-probability event 

Let be the event defined as 


r 


< w : Vm G N, u < f/3, 

Nu - 


< \J{tp-u + 1)2‘/3 “ log(l/5) + {tp - u+l) log(l/5), 
and N* < 1 + 2-\/log(l/5) + 2 log(l/5) > 


^ {uj-.Mu&n,u<tp, Nu- < 2‘'=-“-i£„ and N* <1 + St }. 


where = 2w — u + 1)2 (‘z’ log(l/5) + 2 (f /3 — u + 1)2 “)log(l/5). 
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Lemma 4. The probability of^i under both the distribution of the arm reservoir and the samples of the arms is larger than 
1 — ^1 + ^ fof S small enough, 

Pi(ei) = Pi,2(ei)>i - + 

Proof of Lemma 4. Let m S N. We have by definition that 


L/3 

G lu}, 

k=l 

is a sum of independent Bernoulli random variables of parameter 2““ — 2“““^ = 2“““^. By a Bernstein concentration 
inequality (Lemma 2) for sums of Bernoulli random variables, this implies that with probability 1 ^ O') 


Nu - 2 


tg—U— 1 


< 


\f^- 


log 5u ^ + log 5, 


-1 


Set 5u = exp (— {tp — u + 1)) 5. Notice that for u < ip, log < (ip — u + 1) log Then the result holds by a union 
bound since for S small enough 

*/3 */3 

X] X! (^/3 - ■“ + 1)) ^ 

u—0 u—0 

and by similar argument for N* which together with another union bound give the claim. □ 


The following corollary follows from Lemma 4. 


Corollary 3. Seti* = — 961og2(log2(log(l/5))) — log2(log(l/5))J —2. Let 6 be smaller than an universal constant. 

If n is large enough so that i* > log2(l/i3), thenon^i, there is at least an arm of index in ,Tp} such that it belongs 

to /f.. If k* is its index, then 


Afc. < \E'(\og^(\og(\/5))f^2-~*^'P\og(l/5). 


Proof of Corollary 3. First we have for u < t* 

£u = 2^ (ip-u + l)2-(‘>-“) log(l/(5) + 2{ip - 
< 2 


1)2 


-itp-u) 


log(l/(5) 


(1 + log2(log(l/(5)) + 96log2(log2(l og(l/(5)))) 2(1 + log2( log(l/(5)) + 961og2(log2(log(l/5)))) 


961og2(log(l/5)) 961og2(log(l/(5)) 

< 4a/ 1/(96log2(log(l/5))) + 1/96 + log2(log2(log(l/5)))/ log2(log(l/(5)) 

< 1 / 2 , 


for 5 being a small enough constant. 

This implies that for u < i* 

2‘>-“-i(l - Eu) > 2^"^ X 1/2 > 1. 

This implies that on Nj* > 1, which means there is at least one arm in If.. Let us call k one of these arms. By definition 
of If., it satisfies 

Afc. < G(2-‘*) < L;'2-‘*//3 < iL;'(log2(log(l/5)))962-G//5log(l/5). 
because of Assumption 3, since t* > log2(l/B). □ 


Let for any A: G N, 1 < fc < 


nk = 


log2 


'D\og (max(l,22G/f>A2)/5)\ 
max A^) j 
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where D is a large constant, and 

flu = log 2 

Let also 


^ D log 1 

^max 

(^1,22*>/'’G(2-(“+i))^ 

)4) 

\ max 1 

2-2t>/b^(^(2-(K+i))2'\ 

) 


n-i = log2 


log ^max(l, {B)^)/5 


D 


max ^2 ^G/b^G{By 


Let ^2 = \ bJ '-^k & N*, k < Tp,\/v < Uk 


Xk,i — Mfc 


Z=1 


< 2^G2-^\ogi2^*p/b-'^/6) + 2G2-'’ \og{2'^^pf^-'’/5) 


Lemma 5. Case /3 < 2: Knowing the probability 0/^2 is larger than 1 — H log(l/(5)^5, 

P 2 (C2ia)>i-i^iog(i/5)"<5, 

where H is a constant that depends only on _D, E, E', (3. 

Case P > 2: Knowing ^ 1 , the probability of ^2 is larger than 1 — H\og{l/5)^ log(n)(5, 

P 2 ( 6 IC 1 ) > 1 - ^^log(l/(5)^log(n)^(5, 

where H is a constant that depends only on _D, E, E', /3. 

Proof of Lemma 5. Let [k^v) S N* x N. Since {Xk^i)i are i.i.d. from distribution bounded by G, we have that with 
probability (according to the samples) larger than 1 — 6k,v, 


r>v 


Pk 


< ^2C2-Mog(l/4,.) + 2 C X 2 -" log (1/4.4 • 


Set 5k,v = We have 


E E 6k,V = 2-2*>/'’5 E E 2’' < 2 X 2“^‘'5/'’(5 2”'= < 2 X 2“^‘'5/'’(5 E iV„2" 

k<Tp v<nk k<ff3 y<nk k<Tp 


u—0 


since 2"“ is increasing in u. 

Again, since 2”“ is increasing in u, is implies that on 4. 

00 

y] 4.. < 2 X iv„ 2 "“ 

k<Ta v<nk 


u—0 


t0 


< 2 X 2-2*'^/^^ Tp2^-^ + ^“2"" + 1V*2”*> 

\ M=Llog2(l/B)J+l 

“=Llog2(l/-S)J+l 

E 


(7) 


< 2 X 2"2*'3/^^ 


EB-‘^/P 
( 2^PDE' log 




- y] 2 *>^-i(i+£„)= 

“=Llog2(l/-B)J+l 


£’2-2(“-1)//3 


+ (l + e,4f?log(i)22‘>/'’ 


< 2 X 2"2*'5/^^ I EEE/Eil + ^6i:)/(£;6) log (f)^2*'5"“+2“/'^(24 - 2(u - 1)) + SDlog (i)^ 2^*''/'’ ] , 


E 


U—0 


Since 


£u < 4(4 — M + 1) log (y) and since b < 0, which implies that 2f^ — 2(u — 1) > 1. 
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Case 1: /3 < 2: In this case, b = j3. Since ^ < oo for any v,v' > 0 that on ^ 1 , the last equation implies 

^ 4,. < 2 X 2-«./<>a [+ ?|£I (I) 22f./» + 5Dlog (P’ 2^‘’/A < F. log (J)’ a, 

k<T/3 v<nk \ ^ ) 

where F{, Fi > 0 are constants. 

Case 2: /? > 2: In this case, 6 = 2. Since < oo for any v > 0 that on ^i, the last equation implies 

y: y: < 2 X 2-«-.a (,og + 50log (J)^ 2<-.') 


k<T 3 v<nk 


. 2 - 


< F 2 log (j) tfjS < F 2 log (i) log(n)(5, 
where F 2 , F 2 > 0 are constants. 

Case 3: /3 = 2: In this case, we have on ^1 


i: i: ax,„ < 2 X 2-'»a ^ g log (f)" 2'»(2i, - 2(,. - 1)) + SDlog (1)“ 2'. 

k<Tfi v<nk \ u=0 ^ 


<F3log(i) t|6<F3log(|) log(n)^( 5 . 
where F 3 > 0 is a constant. 

Let ■C = f 1 C ^ 2 - By Lemmas 4 and 5, we know that for a given constant F 4 that depends only on (3, D, E, E', 
• Case P <2: 


• Case P >2: 


P(C)>l-L’4log(i) <5. 


P(0>l-F4log(i)'log(n)35. 


B.5. Upper bound on the number of pulls of the non-near-optimal arms 

Let k be an arm such that k <Tp, and t < n be a time. On the event by definition, we have 


|/4fc,£ k‘k\ Pi 


/Clog ( 22 C/ 7 (Tfe,t 6 )) 2 Clog (22C/7(rfe,7)) 


Tk,t 


T. 


k,t 


which implies by definition of the upper confidence bound that on ^ 

Mfc < Bk^t < fifc + £k,t, where ek,t = 2 


Clog (2%/V( rfc,7)) ^ 2Clog(2^C/V(Tfc,7)) 


Ti 


k.t 


f-k,t 


□ 


( 8 ) 


Let us now write k* for the best arm among the ones in {1,..., 79 }. Note that k* may be different from the best possible 
arm. By Corollary 3, we know that on i^, k* satisfies 

Afc* < (log 2 (log ( 1 / 6 )))'® 2 -‘A //5 log (i) = eL 
Arm k is pulled at time t instead of A:* only if 

Bk,t ^ ^k*,t- 
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On this happens if 


/X* — e* < /ifc + Ek^ti 


which happens if (on 0 


Afc - £* < 

and if we assume that > 2e*, it implies that on ^ arm k is pulled at time t only if 


Afc ^ 


(9) 


We define u such that (i) that Hk & u> [log 2 (-B)J + 1 or (ii) u = —1 otherwise. Assume that 

^ of. .o Dlog(mii 3 c(l, 2 ”»''<'G( 2 -('‘+'l)’)/i) ^ Dlog(2”>/‘G(2-<“+i))f/i) 

- “ m«(2-»./‘,G(2-(-+«)") “ G(2 -(-+i))2 ' 

since we assumed that Ak > 2e*, which implies that tk^t ^ t* < ip. 

By Assumption 3, and since pbk S lu^ we know that < Ak- Therefore, the last equation implies 


D 


Tk.t > 2 "" > 


log(max(^l,220/&G(2 /,5) ^ D\og{2'^^f>/^Al/5) 


max( 2 - 20 /f',G( 2 -(“+i))' 




For such aTk.t,'^^ have 


log ( 2 ^ 0/7 {Tk,t5)) ^ A2 log (02^0/^A2/5) 


Tk, 


D\og (220/t'A2/,5) 


< 


Al log D 
D 


< A2/(16G), 


for D large enough so that D > 32(G + 1) log(32(G + !))• Therefore, by dehnition of £k,t, the last equation implies that 
for Tkp > 2 "“, we have 

£fe,t < Afc/4. 


The last equation implies together with (9) that if Tkp > 2"“, then on arm k is not pulled from time t onwards. In 
particular, this implies that on ^ 

Tfc.„< 2 "“, 

for any k < Tp such that Ak > 2 e*, and such that (i) fik & lu ii u ^ Uog 2 (^)J + 1 or (ii) or u = —1 otherwise. 

Let A be the set of arms such that Ak > 2e*. From the previous equation, the number of times that they are pulled is 
bounded on ^ as 


Tk,n < Y. < Tp2^-^ + Y 

k^A- u<i0 LI 0 S 2 (■®)J —^^3 

Bounding this quantity can be done in essentially the same way as in (7). We again obtain three cases. 


• Case 1: /3 < 2: In this case on ^ 


Vr ^ 2^^DEAogin/S) 

2_^ k,n — - 

keA 


3DF{ 


\og{E/5f2'^^^/i^ < n/H, 


where H is arbitrarily large for A small enough in the dehnition of Tp. 
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• Case 2: /3 > 2; In this case on ^ 


, 2n(nm ^ 


where H is arbitrarily large for A small enough in the definition of Tp. 
• Case 3: /3 = 2; In this case on ^ 


E n.. < ^ g l„g(£;/i)»2'» (2fs - 2(„ - 1)) < n/H, 

kGA ^ 


u—0 


where H is arbitrarily large for A small enough in the definition of Tp. 


Consider now u* such that u* = [log2 (l/F(e*))J. By definition of e*, we know that on we have 

u* >t* - viS) 

Therefore with high probability, by Lemma 4 and as in Corollary 3, on there are less than N{5) arms of index smaller 
than Tp such that < 2e*, where N{S) is a constant that depends only on 6 . For H large enough, on N{5) < H. This 
implies, together with the three cases, that there is at least an arm of index smaller than Tp and such that < 2e* that 
is pulled more than n/H times. This implies that the most pulled arm is such that, on A^ < 2e*. This implies that the 
regret is on ^ bounded as 

^ E'{logSog{ll5))f^ < E"n^E2p)^^Y/^{\og{\og{l/5))f^\og{\l5) 


Therefore, by Lemmas 4 and 5, the previous equation implies in the three cases for some constants i?4, E'": 

• Case /3 < 2; With probability larger than 1 — E^ log(l/(5)^(5, we have 

Tn < E'"n-^/\\og{\og{ll5))fHog{l/5), 
hence with probability larger than 1 — (5, 

Tn < L;4n-i/2log(l/5)(log(log(l/5)))''6 ^ n-1/2. 

• Case /3 > 2; With probability larger than 1 — E^ log(l/(5)^ log(n)^i5, we have 

r-a < L;'"(nlog(n))"^/^(log(log(l/(5)))®®log(l/^), 
hence with probability larger than 1 — (5, 

r„ < L;4(nlog(n))“^/^(log(log(log(n)/<5)))®®log(log(n)/(5) -- (nlog(n))“^/^ loglog(n) logloglog(n)®®. 

• Case j3 = 2: With probability larger than 1 — F4 log(l/(5)^ log(n)^i5, we have 

r„ < L;'"log(n)n"^/^(log(log(l/^)))®® log(l/^), 
hence with probability larger than 1 — (5, 

r„ < L;4log(n)n“^/^(log(log(log(n)/5)))®®log(log(n)/5) -- log(n)n“^/^ loglog(n) logloglog(n)®®. 
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C. Full proof of Theorem 1 

C.l. Case 13 <2 

By Assumption 2 (equivalent to Assumption 3), we know that 

> G(u) > Eu^/^. 


Assume that when pulling an arm from the reservoir, its distribution is Gaussian of mean following the distribution associ¬ 
ated to G and has variance 1. Since the budget is bounded by n, an algorithm pulls at most n arms from the arm reservoir. 
Let us define 


h 




E'c 


1//3 


/n 




Ejc^y/^ 

v/n 


and I 2 



t— I 

y'n 


where ci, are constants. If we denote Ni the number of arms in Ii and N 2 the number of arms in I 2 among the n first 
arms pulled from the arm reservoir, we can use Bernstein’s inequality and for n> 1 larger than a large enough constant 


Pi ^N 2 > (1 + log (1/^))^ < ^ and Pi (1 - log (1/(5))^ < (5. 

Consequently, for ci large enough when compared to c'l, it implies that with probability larger than 1 — 28, we have that 
A^i > 1 and A^i > A^2- Consider the event ^ of probability 1 — 26 where this is satisfied. 

On a problem that is strictly easier than the initial problem is the one where an oracle points out two arms to the learner, 
the best arm in I 2 and the worst arm in Ii, and where the objective is to distinguish between these two arms and output 
the arm in 12- Indeed, this problem is on ^ strictly easier than an intermediate problem where an oracle provides the set of 
arms in Ii U I 2 and asks for an arm in I 2 , since A^i > A^2- On this intermediate problem is in turn strictly easier than 
the original problem of outputting an arm in I 2 without oracle knowledge. Therefore, for the purpose of constructing the 
lower bound, we will now turn to the strictly easier problem of deciding between the arm k* with highest mean in I 2 , and 
the arm k with lowest mean in Ii and prove the lower bound for this strictly easier problem. 

Since the number of pulls on both k* and k is bounded by n, we use the chain rule and the fact that the distributions are 
Gaussian to get on ^ 

KL(fc, k*) < n{p. — p*Y, 

where p, is the mean of k and p* is the mean of k*. Given let p be the probability that k is selected as the best arm, and 
p* the probability that k* is selected as the best arm. By Pinsker’s inequality, we know that on ^ 

1//3 

\p-p*\ < ^KL{ki,k*) < V^\pi - p*\< ^E'^ < E'c\/^. 

Vn 


Since there are only two arms in this simplified game, we know that on ^ 

p* < 1/2 +< 7/12. 

for Cl small enough. By definition of ^ and since the problem we considered is easier than the initial problem, we know 
that for all algorithms, the probability P* of selecting an arm in I 2 is bounded as follows where we add the probability that 
^ does not hold, 

P* < 7/12 + 26 < 2/3, 

for 6 small enough. This concludes the proof by definition of 12- 


C.l. Case /? > 2 

By Assumption 2 (equivalent to Assumption 3), we know that 


E'u^^^ > G(u) > Eu^/^. 
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Assume that when pulling an arm from the reservoir, its distribution is Gaussian of mean following the distribution asso¬ 
ciated to G and has a variance 1. The total number of arms pulled in the reservoir is smaller than n since the budget is 
bounded by n. Let 


h = 




where cq is a constant defined in function of 5 > 0 such that, if we denote Nq for the number of arms in /q, we have 


Pi (TVo = 0) > (l - > exp(-co/2) > 1 - 5. 

Thereupon, there are no arms in /g with probability larger than 1 — 5, and therefore, with probability larger than 1 — (5 the 
regret of the algorithm is larger than 


D. Proof of Lemma 1 

By a union bound, we know that with probability larger than 1 — <5, for all k < N, we have 


\fhk - rnk\ 


< c 


log {N/S) 
N 


Note that by Assumption 2, we have that with probability larger than 1 — 5, 


p*| < c 



Let us write 


Vn 





Note first that with probability larger than 1 — (5 on the samples (not on ruk) 


1 X ^ 1 X ^ 

— l{fl* - TOfe < N~‘^ + vn}>p> - Wfc < - -uat}, 

/c=l /c=l 


We now define for I G {0,1} 

p+ = {p* -rn< N~^ -f Vn) and p~ = (p* - m < N~‘^ - vn) ■ 

Notice that for n larger than a constant that depends on B of Assumption 2, we have by Assumption 2 the following bound 
for * £ {+, —}, since (uatIV'^) < 1/25“^/^ as £ < min(/3,1/2), and also for N larger than a constant that depends on B 
only 


logjp*) 

elogN 

which implies that 


-/? 


^ {vnN^Y/P + max(l, log(i^'), | log(g)|) ^ 5 + max(l, log(i/'), | log(i/)|) 


£log 


e\ogN 


p* > 

where c' > 1/2 is a small constant that is larger than E/2 for n larger than a constant that depends only on B. 
By Hoeffding’s inequality applied to Bernoulli random variables, we have that with probability larger than 1 — 5 


N 


— l{/i* -ruk < N ^ + uat} - p~' 


k=l 


< C 


l0g(l/5) def 
-^=WN, 
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and the same for p with ^ ~ ^ N ^ — vj^}. All of this implies that with probability larger than 1 — 65 

p"*" + wn >P>P~— wn, 

which implies that with probability larger than 1 — 65 

log(p+ + Wn) ^ - ^ log(p~ - Wn) 
e log N —P— g. jQg jY ’ 

i.e., with probability larger than 1 — 65, since wn/p~ < l/2-\/log(l/5) as n is large enough (larger than a constant) and 
since (3 < l/(2e) 

log(p"^) _ 2wn ^ log(p) ^ log(p~) ‘^wn 

elogN p+\og{n)e ~ elogN ~ slogN p~e\ogN^ 

which implies the final claim 

log(p) _ o < 5~^/^/^ + ^log(l/5) + max(l, log(-E0, | log(ig)|) 

£ log N ~ e log N 




